class: center, middle, inverse, title-slide .title[ # Distributions and Central Tendency ] .subtitle[ ## EDP 613 ] .author[ ### Week 4 ] --- <script> function resizeIframe(obj) { obj.style.height = obj.contentWindow.document.body.scrollHeight + 'px'; } </script>
# Prepping a New R Script 1. Open up a blank R script using the menu path **File > New File > R Script**. -- 2. Save this script as `whatever.R` (replacing the term `whatever`) in your R folder. Remember to note where the file is! -- 3. After you have saved this file as `whatever.R`, go to the menu and select **Session > Set Working Directory > To Source File Location**. --- # Getting ready for this session >- Get the file `teampolview.csv` and save it in the same location as this script. >- Install the package `pacman`. Remember you can download it using **Tools > Install Packages** and typing in the name. Please make sure the **Install Dependencies** option has a checkmark beside of it. The install may take a minute. <center> <img src="install-packages.png" width="343" height="252" alt="Italian Trulli"> </center> --- >- `pacman` will automatically install a package if you don't have it and load it up for you. ```r pacman::p_load(tidyverse) ``` --- # Use the Pipe - Here's what it looks like: `%>%`. -- - In RStudio, you can take a shortcut: - For Windows: <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>M</kbd> (Windows) - For Macs: <kbd>Cmd</kbd>+<kbd>Shift</kbd>+<kbd>M</kbd> (Mac) --- # Basic Logic ```r "get up in the morning" %>% "drink a lot of coffee" %>% "come to work" %>% "do stuff" %>% "go home "%>% "eat" %>% "sleep (maybe)" ``` - works like layers - you can highlight parts of it to run --- # Example - Use the default `starwars` data set -- - type in `starwars` -- count: false .panel1-sw1-auto[ ```r *starwars ``` ] .panel2-sw1-auto[ ``` # A tibble: 87 × 14 name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi… 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi… 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon # … with 77 more rows, 4 more variables: species <chr>, films <list>, # vehicles <list>, starships <list>, and abbreviated variable names # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld ``` ] --- count: false .panel1-sw1-auto[ ```r starwars %>% * select(name, species, homeworld) ``` ] .panel2-sw1-auto[ ``` # A tibble: 87 × 3 name species homeworld <chr> <chr> <chr> 1 Luke Skywalker Human Tatooine 2 C-3PO Droid Tatooine 3 R2-D2 Droid Naboo 4 Darth Vader Human Tatooine 5 Leia Organa Human Alderaan 6 Owen Lars Human Tatooine 7 Beru Whitesun lars Human Tatooine 8 R5-D4 Droid Tatooine 9 Biggs Darklighter Human Tatooine 10 Obi-Wan Kenobi Human Stewjon # … with 77 more rows ``` ] --- count: false .panel1-sw1-auto[ ```r starwars %>% select(name, species, homeworld) %>% * head() ``` ] .panel2-sw1-auto[ ``` # A tibble: 6 × 3 name species homeworld <chr> <chr> <chr> 1 Luke Skywalker Human Tatooine 2 C-3PO Droid Tatooine 3 R2-D2 Droid Naboo 4 Darth Vader Human Tatooine 5 Leia Organa Human Alderaan 6 Owen Lars Human Tatooine ``` ] <style> .panel1-sw1-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw1-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw1-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Run a smaller chunk Highlight the first two lines and run it -- <center> <img src="runpart_run.gif" width="500" alt="Run Part of Chunk"> </center> --- # Output <center> <img src="runpart_out.gif" width="500" alt="Run Part of Chunk"> </center> --- # Now on to Descriptives --- # Frequency distributions - *Frequency distribution* tells us how many observations there are at different values of a variable. -- - You could count manually...but why? -- - We can have R do the work for us using a *frequency table* --- ## Single variable counts count: false .panel1-sw2-auto[ ```r *starwars ``` ] .panel2-sw2-auto[ ``` # A tibble: 87 × 14 name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi… 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi… 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon # … with 77 more rows, 4 more variables: species <chr>, films <list>, # vehicles <list>, starships <list>, and abbreviated variable names # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld ``` ] --- count: false .panel1-sw2-auto[ ```r starwars %>% * select(name, species, homeworld) ``` ] .panel2-sw2-auto[ ``` # A tibble: 87 × 3 name species homeworld <chr> <chr> <chr> 1 Luke Skywalker Human Tatooine 2 C-3PO Droid Tatooine 3 R2-D2 Droid Naboo 4 Darth Vader Human Tatooine 5 Leia Organa Human Alderaan 6 Owen Lars Human Tatooine 7 Beru Whitesun lars Human Tatooine 8 R5-D4 Droid Tatooine 9 Biggs Darklighter Human Tatooine 10 Obi-Wan Kenobi Human Stewjon # … with 77 more rows ``` ] --- count: false .panel1-sw2-auto[ ```r starwars %>% select(name, species, homeworld) %>% * count(species) ``` ] .panel2-sw2-auto[ ``` # A tibble: 38 × 2 species n <chr> <int> 1 Aleena 1 2 Besalisk 1 3 Cerean 1 4 Chagrian 1 5 Clawdite 1 6 Droid 6 7 Dug 1 8 Ewok 1 9 Geonosian 1 10 Gungan 3 # … with 28 more rows ``` ] <style> .panel1-sw2-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw2-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw2-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Multiple variable counts count: false .panel1-sw3-auto[ ```r *starwars ``` ] .panel2-sw3-auto[ ``` # A tibble: 87 × 14 name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi… 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi… 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon # … with 77 more rows, 4 more variables: species <chr>, films <list>, # vehicles <list>, starships <list>, and abbreviated variable names # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld ``` ] --- count: false .panel1-sw3-auto[ ```r starwars %>% * select(name, species, homeworld) ``` ] .panel2-sw3-auto[ ``` # A tibble: 87 × 3 name species homeworld <chr> <chr> <chr> 1 Luke Skywalker Human Tatooine 2 C-3PO Droid Tatooine 3 R2-D2 Droid Naboo 4 Darth Vader Human Tatooine 5 Leia Organa Human Alderaan 6 Owen Lars Human Tatooine 7 Beru Whitesun lars Human Tatooine 8 R5-D4 Droid Tatooine 9 Biggs Darklighter Human Tatooine 10 Obi-Wan Kenobi Human Stewjon # … with 77 more rows ``` ] --- count: false .panel1-sw3-auto[ ```r starwars %>% select(name, species, homeworld) %>% * count(species, homeworld) ``` ] .panel2-sw3-auto[ ``` # A tibble: 58 × 3 species homeworld n <chr> <chr> <int> 1 Aleena Aleen Minor 1 2 Besalisk Ojom 1 3 Cerean Cerea 1 4 Chagrian Champala 1 5 Clawdite Zolan 1 6 Droid Naboo 1 7 Droid Tatooine 2 8 Droid <NA> 3 9 Dug Malastare 1 10 Ewok Endor 1 # … with 48 more rows ``` ] <style> .panel1-sw3-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw3-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw3-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> -- Better but a large table is difficult to picture... --- # Arranging Data We can arrange the data set count: false .panel1-sw4-auto[ ```r *starwars ``` ] .panel2-sw4-auto[ ``` # A tibble: 87 × 14 name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi… 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi… 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon # … with 77 more rows, 4 more variables: species <chr>, films <list>, # vehicles <list>, starships <list>, and abbreviated variable names # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld ``` ] --- count: false .panel1-sw4-auto[ ```r starwars %>% * select(name, species, homeworld) ``` ] .panel2-sw4-auto[ ``` # A tibble: 87 × 3 name species homeworld <chr> <chr> <chr> 1 Luke Skywalker Human Tatooine 2 C-3PO Droid Tatooine 3 R2-D2 Droid Naboo 4 Darth Vader Human Tatooine 5 Leia Organa Human Alderaan 6 Owen Lars Human Tatooine 7 Beru Whitesun lars Human Tatooine 8 R5-D4 Droid Tatooine 9 Biggs Darklighter Human Tatooine 10 Obi-Wan Kenobi Human Stewjon # … with 77 more rows ``` ] --- count: false .panel1-sw4-auto[ ```r starwars %>% select(name, species, homeworld) %>% * count(species, homeworld) ``` ] .panel2-sw4-auto[ ``` # A tibble: 58 × 3 species homeworld n <chr> <chr> <int> 1 Aleena Aleen Minor 1 2 Besalisk Ojom 1 3 Cerean Cerea 1 4 Chagrian Champala 1 5 Clawdite Zolan 1 6 Droid Naboo 1 7 Droid Tatooine 2 8 Droid <NA> 3 9 Dug Malastare 1 10 Ewok Endor 1 # … with 48 more rows ``` ] --- count: false .panel1-sw4-auto[ ```r starwars %>% select(name, species, homeworld) %>% count(species, homeworld) %>% * arrange(-n) ``` ] .panel2-sw4-auto[ ``` # A tibble: 58 × 3 species homeworld n <chr> <chr> <int> 1 Human Tatooine 8 2 Human Naboo 5 3 Human <NA> 5 4 Droid <NA> 3 5 Gungan Naboo 3 6 Human Alderaan 3 7 Droid Tatooine 2 8 Human Corellia 2 9 Human Coruscant 2 10 Kaminoan Kamino 2 # … with 48 more rows ``` ] <style> .panel1-sw4-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw4-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw4-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> -- Well that's better but nothing really beats a picture so... --- # Let's Make a Bar Plot 1. Assign the data to a variable ```r sw_counts <- starwars %>% select(name, species, homeworld) %>% count(species) ``` --- <span>2.</span> Set up the visual using `ggplot()` count: false .panel1-sw5-auto[ ```r *ggplot(data = sw_counts, * aes(x = species, y = n)) ``` ] .panel2-sw5-auto[ ![](Slides-Week-4R_files/figure-html/sw5_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-sw5-auto[ ```r ggplot(data = sw_counts, aes(x = species, y = n)) + * geom_bar(stat = "identity") ``` ] .panel2-sw5-auto[ ![](Slides-Week-4R_files/figure-html/sw5_auto_02_output-1.png)<!-- --> ] <style> .panel1-sw5-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw5-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw5-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> -- <br> - Well that looks terrible -- - Maybe we can just look at a few of them --- # Filtering Data ```r sw_filtered <- starwars %>% select(name, species, homeworld) %>% count(species, homeworld) %>% filter(species %in% c("Human", "Droid", "Gungan")) ``` --- count: false .panel1-sw6-auto[ ```r *ggplot(data = sw_filtered, * aes( * x = species, * y = n * ) * ) ``` ] .panel2-sw6-auto[ ![](Slides-Week-4R_files/figure-html/sw6_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-sw6-auto[ ```r ggplot(data = sw_filtered, aes( x = species, y = n ) ) + * geom_bar(stat = "identity") ``` ] .panel2-sw6-auto[ ![](Slides-Week-4R_files/figure-html/sw6_auto_02_output-1.png)<!-- --> ] <style> .panel1-sw6-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw6-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw6-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Numerical count: false .panel1-sw7-auto[ ```r *ggplot(data = sw_filtered, * aes( * x = species, * y = n, * fill = n * ) * ) ``` ] .panel2-sw7-auto[ ![](Slides-Week-4R_files/figure-html/sw7_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-sw7-auto[ ```r ggplot(data = sw_filtered, aes( x = species, y = n, fill = n ) ) + * geom_bar(stat = "identity") ``` ] .panel2-sw7-auto[ ![](Slides-Week-4R_files/figure-html/sw7_auto_02_output-1.png)<!-- --> ] <style> .panel1-sw7-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw7-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw7-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Categorical count: false .panel1-sw8-auto[ ```r *ggplot(data = sw_filtered, * aes( * x = species, * y = n, * fill = species * ) * ) ``` ] .panel2-sw8-auto[ ![](Slides-Week-4R_files/figure-html/sw8_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-sw8-auto[ ```r ggplot(data = sw_filtered, aes( x = species, y = n, fill = species ) ) + * geom_bar(stat = "identity") ``` ] .panel2-sw8-auto[ ![](Slides-Week-4R_files/figure-html/sw8_auto_02_output-1.png)<!-- --> ] <style> .panel1-sw8-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw8-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw8-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Both! count: false .panel1-sw9-auto[ ```r *ggplot(data = sw_filtered, * aes( * x = species, * y = n, * fill = n * ) * ) ``` ] .panel2-sw9-auto[ ![](Slides-Week-4R_files/figure-html/sw9_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-sw9-auto[ ```r ggplot(data = sw_filtered, aes( x = species, y = n, fill = n ) ) + * geom_bar(stat = "identity") ``` ] .panel2-sw9-auto[ ![](Slides-Week-4R_files/figure-html/sw9_auto_02_output-1.png)<!-- --> ] --- count: false .panel1-sw9-auto[ ```r ggplot(data = sw_filtered, aes( x = species, y = n, fill = n ) ) + geom_bar(stat = "identity") + * facet_wrap(n ~ .) ``` ] .panel2-sw9-auto[ ![](Slides-Week-4R_files/figure-html/sw9_auto_03_output-1.png)<!-- --> ] <style> .panel1-sw9-auto { color: white; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw9-auto { color: white; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw9-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Loading up local data To explore this, let's load the 2012 voter fraud file first and assign it to a variable. We can do this using the `read_csv()` command from the `readr` package within `tidyverse`. ```r voter_fraud <- read_csv("2012_Voter_Fraud.csv") ``` --- ## Side note R itself uses `read.csv` which can be a royal pain if you don't know what you're doing. Its strongly advised that you stick with the tidy way of loading data. Remember: >- `read_csv` with a `_` is tidy >- `read.csv` with a `.` is messy --- # Measures of Central Tendency To take a look at how we assess the mean, median, and mode, let's use our original data set and first look at the `total` column which has the raw data counts. ```r voter_fraud %>% select(total) ``` ``` # A tibble: 50 × 1 total <dbl> 1 6 2 11 3 1 4 24 5 20 6 21 7 3 8 9 9 48 10 20 # … with 40 more rows ``` --- For the mean, we use ```r voter_fraud %>% summarize(Average = mean(total)) ``` ``` # A tibble: 1 × 1 Average <dbl> 1 13.3 ``` --- For the median, we use ```r voter_fraud %>% summarize(Average = median(total)) ``` ``` # A tibble: 1 × 1 Average <dbl> 1 11 ``` --- For the mode, we use ```r voter_fraud %>% summarize(Average = mode(total)) ``` ``` # A tibble: 1 × 1 Average <chr> 1 numeric ``` `mode` still doesn't work! --- # A Mode You Can Use ```r Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } # Notice that 'Mode' is capitalized so that R won't confuse it # with its internal command 'mode'. ``` --- ```r voter_fraud %>% summarize(Average = Mode(total)) ``` ``` # A tibble: 1 × 1 Average <dbl> 1 4 ``` --- # On Your Own This is your chance to get some practice in and to ask questions. You won't get the opportunity to get help during quizzes and exams so take advantage now! Open up a new script and load up the `Box Office.csv` data set in R. This set was scraped from Rotten Tomatoes prior to Avengers: Endgame becoming the highest grossing movie of all time. Now try answering the following questions using R: 1. What is the average number of positive reviews for the top five movies? 2. What are the average number of negative reviews for the bottom five movies? 3. How were movies released over the years? Provide counts and a visualization. 4. Which measure of central tendency is the best to describe the average number of movies over the years? 5. Which year has the most number of ranked movies? I'll post the solutions next week! --- ## That's it for today!